{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Sliceguard – Find critical data segments in your data (fast)\n",
    "## Unstructured Data Walkthrough"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🕹️ **[Interactive Demo](https://huggingface.co/spaces/renumics/sliceguard-unstructured-data)**\n",
    "\n",
    "Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.\n",
    "\n",
    "This notebook showcases especially the **unstructured** data case, so if you have tabular data instead have a look at **[guide for structured data](./examples/quickstart_structured_data.ipynb)** instead\n",
    "\n",
    "It is interesting for you if you want to do the following:\n",
    "1. Find **performance issues** of your machine learning model.\n",
    "2. Find **anomalies and inconsistencies** in your data.\n",
    "3. Quickly **explore** your data using an interactive report to generate **insights**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this notebook install and import sliceguard:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U sliceguard[all]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sliceguard import SliceGuard\n",
    "from sliceguard.data import from_huggingface\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.svm import SVC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now download the demo dataset with our utility function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = from_huggingface(\"Matthijs/snacks\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You now have the following dataframe containing an image column with a path to the raw image on the harddrive, a label and a split marker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>image</th>\n",
       "      <th>label</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>apple</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>apple</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>apple</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>apple</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>apple</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>950</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>watermelon</td>\n",
       "      <td>validation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>951</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>watermelon</td>\n",
       "      <td>validation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>952</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>watermelon</td>\n",
       "      <td>validation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>953</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>watermelon</td>\n",
       "      <td>validation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>954</th>\n",
       "      <td>/home/daniel/.cache/huggingface/datasets/downl...</td>\n",
       "      <td>watermelon</td>\n",
       "      <td>validation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6745 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 image       label       split\n",
       "0    /home/daniel/.cache/huggingface/datasets/downl...       apple       train\n",
       "1    /home/daniel/.cache/huggingface/datasets/downl...       apple       train\n",
       "2    /home/daniel/.cache/huggingface/datasets/downl...       apple       train\n",
       "3    /home/daniel/.cache/huggingface/datasets/downl...       apple       train\n",
       "4    /home/daniel/.cache/huggingface/datasets/downl...       apple       train\n",
       "..                                                 ...         ...         ...\n",
       "950  /home/daniel/.cache/huggingface/datasets/downl...  watermelon  validation\n",
       "951  /home/daniel/.cache/huggingface/datasets/downl...  watermelon  validation\n",
       "952  /home/daniel/.cache/huggingface/datasets/downl...  watermelon  validation\n",
       "953  /home/daniel/.cache/huggingface/datasets/downl...  watermelon  validation\n",
       "954  /home/daniel/.cache/huggingface/datasets/downl...  watermelon  validation\n",
       "\n",
       "[6745 rows x 3 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for data slices that are particulary different (Outliers/Errors in the data)\n",
    "Here sliceguard will train an **outlier detection** model to highlight data points that are especially different from the rest. Note that we here run the outlier detection only on the class popcorn. You will easily see there is a bunch of weird cases where the popcorn is a popcorn package or people are the most salient part of the image.\n",
    "\n",
    "You can use the **report feature** that uses [Renumics Spotlight](https://github.com/Renumics/spotlight) for visualization to dig into the reasons why a cluster is considered an outlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df[df[\"label\"] == \"popcorn\"], features=[\"image\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for data slices where models are prone to fail (hard samples, inconsistencies)\n",
    "Here sliceguard will **train a classification model** and check for data slices where the accuracy score is particulary bad. You will realize that this will show you a whole bunch of **ambiguous images** in this dataset, potentially containing both classes. There are also **mislabeled** examples and images that are just **challenging** for the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df, features=[\"image\"], y=\"label\", metric=accuracy_score, automl_task=\"classification\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_embeddings = sg.embeddings[\"image\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for weaknesses of your own model (...and hard samples + inconsistencies)\n",
    "This shows how to pass your **own model predictions** into sliceguard to find slices that are performing badly according to a supplied metric function. This allows you to uncover **inconsistencies** and samples that are **hard to learn** in no time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model and predict on the same data (of course in practice you will want to split your data!!!)\n",
    "# This is only for showing the principle\n",
    "clf = SVC()\n",
    "clf.fit(image_embeddings, df[\"label\"])\n",
    "df[\"predictions\"] = clf.predict(image_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pass the predictions to sliceguard and uncover hard samples and inconsistencies.\n",
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df, features=[\"image\"], y=\"label\", y_pred=\"predictions\", metric=accuracy_score, precomputed_embeddings={\"image\": image_embeddings})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
